This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

The cleaning steps demonstrated include:

  • Loading the data
  • column name cleaning
  • column selection
  • Designating column classes
  • Filtering rows
  • Re-coding values
  • Creating groups (case_when())
  • Dealing with character case (upper, lower, title, etc.)
  • Factor columns

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               epitrix     # data cleaning
               )

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

linelist_raw <- import("ebola_simulated.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

# display the linelist data as a table
DT::datatable(head(linelist_raw,50), rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Cleaning pipeline

In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain often utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE).

In a cleaning pipeline the order of the steps is important. Cleaning steps may include:

  • Column names may be cleaned or changed
  • Rows may be filtered or added
  • Columns may be selected, added, transformed, or re-ordered
  • Values may be re-coded, cleaned, or grouped

Column names

Column names are used very often so they need to have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replaced with underscores (_),
  • No unusual characters (&, #…)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date" 
##  [5] "date onset"      "hosp date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_column"   "...23"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic syntax cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific name replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date" 
##  [5] "date_onset"      "hosp_date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_column"   "x23"

NOTE: The column name “…23” was changed to “x23”.

Manual column name cleaning

Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn = 1,
       newNameForSecondColumn = 2)

Empty Excel column names

If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).

Merged Excel column names

Merged cells in an Excel file are a common occurrance when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accomodate merged cells. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook(here("data", "ebola_simulated.xlsx"), fillMergedCells = TRUE)

DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Skip import of column names

Sometimes, you may want to avoid importing a row of data. Using import() from the rio package on a .xlsx file, you can do this with the argument skip =. Provide the number of rows you want to skip.

linelist_raw <- import("ebola_simulated.xlsx", skip = 1)  # does not import header row

Second header row

You may need to avoid importing the second row of data, for example if it is a data dictionary row (as in the example linelist). This can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# REMOVE 2nd ROW (DATA DICTIONARY)
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# For excel files
linelist_raw_names <- import("ebola_simulated.xlsx") %>% names()
linelist_raw <- import("ebola_simulated.xlsx", skip = 2, col_names = linelist_raw_names) # argument is 'col_names'

For CSV files:

# For csv files
linelist_raw_names <- import("ebola_simulated.csv") %>% names()
linelist_raw <- import("ebola_simulated.csv", skip = 2, col.names = linelist_raw_names) # argument is 'col.names'

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("ebola_simulated.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row.

names(df) <- paste(names(df), df[1, ], sep = "_")

Select or re-order columns

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.

Here are all the column names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_onset"           "date_hospitalisation"
##  [7] "date_outcome"         "outcome"              "gender"              
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_column"        "x23"

With select() you can do the following:

Select only the columns you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation"
## [4] "fever"

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
  names()
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"            
## [10] "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"            
## [16] "merged_column"        "x23"

Re-order the columns - use everything() to signify all other columns not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation"
##  [4] "gender"               "row_num"              "generation"          
##  [7] "date_infection"       "date_outcome"         "outcome"             
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_column"        "x23"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - columns containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

Here is an example using where():

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation"
## [4] "date_outcome"
# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"            
## [4] "fever"

Adding select()to the cleaning pipe chain:

In the linelist, there are a few columns we do not need: row_num, merged_column, and x23. Remove them by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_column, x23))

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add columns and rows

See the tabs below to add columns and rows

Add columns

mutate()

We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).

The verb mutate() is used to add a new column, or to modify an existing one. Below are some example of creating new columns with mutate(). The syntax is: new_column_name = value or function. It is best practice to separate each new column with a comma and new line.

linelist <- linelist %>%                       # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,             # new column = duplicate/copy another column
         new_var_static = 7,                   # new column = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
         new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
         ) 

Scroll to the right to see the new columns:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

New columns using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add rows

TO DO

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)

use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.

If your class is off you will see an error like this: Error: Can’t combine ..1$infection date and ..2$infection date . (for a date value remember to wrap the date in the functionas.Date() like as.Date("2020-10-10"))

New columns using grouped values

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

#Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as #soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Fix classes

CAUTION: This tab may follow from previous tabs.

See section on object classes

Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class.

First we run some checks on the classes of important columns.

The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

However, if we try to classify this column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!

## 
## 15th April 2014      2012-05-01      2012-06-17      2012-06-21      2012-06-24 
##               1               1               1               1               2 
##      2012-06-27 
##               1

Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).

The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_column, x23)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################
  # fix incorrect values                 # old value       # new value
  mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
  # correct the class of the columns
  mutate(across(contains("date"), as.Date), 
         generation = as.numeric(generation),
         age        = as.numeric(age))
         
         

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Fix class for multiple columns at once

class(linelist$date_infection)
## [1] "Date"
head(linelist$date_infection)
## [1] "2014-04-09" NA           NA           "2014-05-07" NA          
## [6] "2014-05-06"

You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Recoding values

blah blah blah TO DO

Manual recoding

Manual using mutate()

mutate() is also used to recode the values in a column. For example, in linelist the values in the column “hospital” must be cleaned. There are several incorrect spellings, and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                            Connaught Hopital 
##                                           48 
##                           Connaught Hospital 
##                                         1757 
##                                   Hospital A 
##                                           54 
##                                   Hospital B 
##                                           54 
##                             Military Hopital 
##                                           31 
##                            Military Hospital 
##                                          802 
##                             Mitylira Hopital 
##                                            1 
##                            Mitylira Hospital 
##                                           80 
##                                        other 
##                                          905 
##  Princess Christian Maternity Hopital (PCMH) 
##                                           11 
## Princess Christian Maternity Hospital (PCMH) 
##                                          421 
##                               Rokupa Hopital 
##                                           11 
##                              Rokupa Hospital 
##                                          452 
##                                         <NA> 
##                                         1500

Manual using recode()

To change spellings manually, one-by-one, you can use the recode() function within the mutate function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Connaught Hopital" = "Connaught Hospital",
                      "Rokupa Hopital"    = "Rokupa Hospital",
                      "other"             = "Other",
                      "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                      ))

Now we see the values in the hospital column have been corrected:

table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1805 
##                                   Hospital A 
##                                           54 
##                                   Hospital B 
##                                           54 
##                            Military Hospital 
##                                          914 
##                                        Other 
##                                          905 
## Princess Christian Maternity Hospital (PCMH) 
##                                          432 
##                              Rokupa Hospital 
##                                          463 
##                                         <NA> 
##                                         1500

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

Manual using base R

If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new column by assigning it a value. In the command below, the column new_var does not exist until after the command is executed. In this simple example the column is assigned the static value “new value”, so for all rows the value will be “new value”.

linelist_raw$new_var <- "new value"

If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!

Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.

linelist$date_onset[linelist$case_id == "9d4019"] <- as.Date("2020-10-24")

Recoding by logic/condition

Logic using case_when()

If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!

Note that all Right-hand side (RHS) inputs must be of the same class (e.g. character, numeric, logical). Notice the use of the special value NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(age_years = case_when(
            age_unit == "years"  ~ age,     # if age is given in years
            age_unit == "months" ~ age/12,  # if age is given in months
            is.na(age_unit)      ~ age,     # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # Any other circumstance

Logic using ifelse() and if_else()

For simple uses of logical re-coding or new variable creationgyou can use ifelse() or if_else(). Though in most cases it is better to use case_when().

These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if condition evaluates to TRUE, value if condition evaluates to FALSE). If used in a mutate(), each row is evaluated. if_else() is a special version from dplyr that handles dates in the condition.

It can be tempting to string together many ifelse commands… resist this and use case_when() instead! It is much more simple, easier to read, and easier to identify errors.

IMAGE of ifelse string with X across is.

You can reference other columns with the ifelse() function within mutate():

Example of ifelse():

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

Example of if_else() (using dates): Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Note: If you want to alternate a value used in your code based on other circumstances, consider using switch() from base R. For example if… TO DO. See the section on using switch() in the page on R interactive console.

Recoding using special dplyr functions

Using replace_na()

To change missing values (NA) to a specific character value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Using na_if()

Likewise you can quickly convert a specific character value to NA using na_if(). The command below is the opposite of the one above. It converts any values of “Missing” to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Using coalesce()

This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.

For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.

linelist <- linelist %>% 
  mutate(location = coalesce(patient_residence, reporting_jurisdiction))

TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

Recoding using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(column %in% c(names(mll_raw), ".global"))

## define columns that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean columns using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Add recoding to the pipe chain

Here we add the described cleaning steps to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_column, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Connaught Hopital" = "Connaught Hospital",
                      "Rokupa Hopital"    = "Rokupa Hospital",
                      "other"             = "Other",
                      "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

Filter rows

CAUTION: This tab may follow from previous tabs.

A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

A simple filter()

A simple example applies a filter command within a pipe chain. The command re-defines the dataframe linelist as itself having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.
In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6127.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6124.

A complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

#### Be aware how filters handle missing numeric and date values

Can we just filter by onset_date to rows after June 2013? Caution! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

Design the filter

What other criteria can we filter on to remove the first outbreak from the dataset? We also happen to know that this first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to cross-tabulate these columns to know exactly how many rows we expect should be removed.

Let’s examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2012 2013 2014 2015 <NA>
##   Connaught Hospital                              8    1 1377  339   80
##   Hospital A                                     34   18    0    0    2
##   Hospital B                                     39   12    0    0    3
##   Military Hospital                               0    0  676  199   39
##   Missing                                         0    0 1129  306   62
##   Other                                           0    0  687  176   42
##   Princess Christian Maternity Hospital (PCMH)    0    0  327   90   15
##   Rokupa Hospital                                 0    0  347   97   19
##   <NA>                                            0    0    0    0    0

We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our filter statement:

linelist <- linelist %>% 
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6007

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2014 2015 <NA>
##   Connaught Hospital                           1377  339   80
##   Military Hospital                             676  199   39
##   Missing                                      1129  306   62
##   Other                                         687  176   42
##   Princess Christian Maternity Hospital (PCMH)  327   90   15
##   Rokupa Hospital                               347   97   19
##   <NA>                                            0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity. Adding these filters to the cleaning pipe chain now looks like this:

Filter as a stand-alone command

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

Filtering to quickly review data

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Note: the above command can also be written with dplyr verbs filter() and select() as below:

View(linelist %>% filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>% select(date_onset, date_hospitalisation)

Add the filters to the cleaning pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_column, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Connaught Hopital" = "Connaught Hospital",
                      "Rokupa Hopital"    = "Rokupa Hospital",
                      "other"             = "Other",
                      "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years"  ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit)      ~ age,
          TRUE                 ~ NA_real_)) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

Numeric categories

CAUTION: This tab may follow from previous tabs.

Special approaches for creating numeric categories

Common examples include age categories, groups of lab values, etc.

There are several ways to create categories of a numeric column such as age. Here we will discuss:

  • cut(), from base R
  • age_categories(), from the epikit package
  • using percentiles to break your numbers
  • natural break points… ? TO DO
  • case_when()

Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years).

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

Numeric categories using cut()

You can use the base function cut(), which creates categories from a numeric variable.

The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is automatically a ordered factor.

If used within mutate() (a dplyr verb) it is not necessary to specify the dataset before the column name (e.g. linelist$age_years).

Quick example

Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points:

  • Specify numeric vector of break points c(0, 5, 10, 15, ...)
  • Default behavior for cut() is that lower break values are excluded from each category, and upper break values are included
  • Include 0 in the lowest category by adding include.lowest = TRUE
  • Add a vector of customized labels using the labels = argument
  • Be aware of and address missing values
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,                     # numeric column
                        breaks = c(0, 5, 10, 15, 20,
                                   30, 50, 70, 100),  # break points for categories
                                                      # default: lower breaks excluded and upper included
                        include.lowest = TRUE,        # include 0 in lowest category
                        labels = c("0-5", "6-10", "11-15", "16-20",       # manual labels - be careful!
                                   "21-30", "31-50", "51-70", "71-100")))       

table(linelist$age_cat, useNA = "always")
## 
##    0-5   6-10  11-15  16-20  21-30  31-50  51-70 71-100   <NA> 
##   1353   1089    985    814   1048    600     27      1     90

Detailed example

Below is a detailed description of the behavior of using cut() to make the age_cat column:

  • Inclusion/exclusion behavior of break points
  • Custom category labels
  • Handling missing values
  • Check your work!

The most basic, simple version of cut() applied to age_years to make the new variable age_cat:

# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years, breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100)))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    (0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100] 
##     1230     1089      985      814     1048      600       27        1 
##     <NA> 
##      213

Inclusion/exclusion criteria for break values

By default, the grouping occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). You can reverse this behavior by providing the right argument and setting equal to FALSE (explanation below).

Check value assignment

Check your work!!! Verify that each age value was assigned to the correct category. Note how above, 213 rows are categorized as NA. Should all of these be assigned NA? Check by cross-tabulating the age_years and age_cat columns (names added for clarity).

You can also see in the cross-tabulation below that “0” values were assigned as missing (NA)! This is because the command was by default “closed” on the lower side of each category. These “0” values could be infants coded as age 0. Read more how to address this situation below.

Also check category boundary values are assigned to. For example the 213 15-year-olds. Currently they are in the “(10,15]” category which includes ages 11-15, and they are not in the "(15,20] category which includes ages 16-20.

# Cross tabulation of the numeric and category columns. Names specified for clarity.
table("Numeric Values" = linelist$age_years,
      "Categories" = linelist$age_cat,
      useNA = "always")
##                    Categories
## Numeric Values      (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70]
##   0                     0      0       0       0       0       0       0
##   0.166666666666667     2      0       0       0       0       0       0
##   0.25                  2      0       0       0       0       0       0
##   0.333333333333333     5      0       0       0       0       0       0
##   0.416666666666667     2      0       0       0       0       0       0
##   0.5                   2      0       0       0       0       0       0
##   0.583333333333333     1      0       0       0       0       0       0
##   0.75                  1      0       0       0       0       0       0
##   0.833333333333333     3      0       0       0       0       0       0
##   0.916666666666667     3      0       0       0       0       0       0
##   1                   254      0       0       0       0       0       0
##   1.5                   1      0       0       0       0       0       0
##   2                   252      0       0       0       0       0       0
##   3                   234      0       0       0       0       0       0
##   4                   225      0       0       0       0       0       0
##   5                   243      0       0       0       0       0       0
##   6                     0    213       0       0       0       0       0
##   7                     0    240       0       0       0       0       0
##   8                     0    210       0       0       0       0       0
##   9                     0    208       0       0       0       0       0
##   10                    0    218       0       0       0       0       0
##   11                    0      0     203       0       0       0       0
##   12                    0      0     192       0       0       0       0
##   13                    0      0     189       0       0       0       0
##   14                    0      0     188       0       0       0       0
##   15                    0      0     213       0       0       0       0
##   16                    0      0       0     161       0       0       0
##   17                    0      0       0     185       0       0       0
##   18                    0      0       0     171       0       0       0
##   19                    0      0       0     149       0       0       0
##   20                    0      0       0     148       0       0       0
##   21                    0      0       0       0     143       0       0
##   22                    0      0       0       0     132       0       0
##   23                    0      0       0       0     122       0       0
##   24                    0      0       0       0     124       0       0
##   25                    0      0       0       0     117       0       0
##   26                    0      0       0       0      90       0       0
##   27                    0      0       0       0     100       0       0
##   28                    0      0       0       0      83       0       0
##   29                    0      0       0       0      66       0       0
##   30                    0      0       0       0      71       0       0
##   31                    0      0       0       0       0      78       0
##   32                    0      0       0       0       0      49       0
##   33                    0      0       0       0       0      59       0
##   34                    0      0       0       0       0      54       0
##   35                    0      0       0       0       0      52       0
##   36                    0      0       0       0       0      39       0
##   37                    0      0       0       0       0      39       0
##   38                    0      0       0       0       0      32       0
##   39                    0      0       0       0       0      31       0
##   40                    0      0       0       0       0      28       0
##   41                    0      0       0       0       0      24       0
##   42                    0      0       0       0       0      19       0
##   43                    0      0       0       0       0      24       0
##   44                    0      0       0       0       0      17       0
##   45                    0      0       0       0       0      11       0
##   46                    0      0       0       0       0       9       0
##   47                    0      0       0       0       0       8       0
##   48                    0      0       0       0       0      15       0
##   49                    0      0       0       0       0       9       0
##   50                    0      0       0       0       0       3       0
##   51                    0      0       0       0       0       0       3
##   52                    0      0       0       0       0       0       6
##   53                    0      0       0       0       0       0       2
##   54                    0      0       0       0       0       0       4
##   55                    0      0       0       0       0       0       2
##   57                    0      0       0       0       0       0       2
##   58                    0      0       0       0       0       0       1
##   59                    0      0       0       0       0       0       2
##   60                    0      0       0       0       0       0       1
##   61                    0      0       0       0       0       0       2
##   62                    0      0       0       0       0       0       1
##   64                    0      0       0       0       0       0       1
##   72                    0      0       0       0       0       0       0
##   <NA>                  0      0       0       0       0       0       0
##                    Categories
## Numeric Values      (70,100] <NA>
##   0                        0  123
##   0.166666666666667        0    0
##   0.25                     0    0
##   0.333333333333333        0    0
##   0.416666666666667        0    0
##   0.5                      0    0
##   0.583333333333333        0    0
##   0.75                     0    0
##   0.833333333333333        0    0
##   0.916666666666667        0    0
##   1                        0    0
##   1.5                      0    0
##   2                        0    0
##   3                        0    0
##   4                        0    0
##   5                        0    0
##   6                        0    0
##   7                        0    0
##   8                        0    0
##   9                        0    0
##   10                       0    0
##   11                       0    0
##   12                       0    0
##   13                       0    0
##   14                       0    0
##   15                       0    0
##   16                       0    0
##   17                       0    0
##   18                       0    0
##   19                       0    0
##   20                       0    0
##   21                       0    0
##   22                       0    0
##   23                       0    0
##   24                       0    0
##   25                       0    0
##   26                       0    0
##   27                       0    0
##   28                       0    0
##   29                       0    0
##   30                       0    0
##   31                       0    0
##   32                       0    0
##   33                       0    0
##   34                       0    0
##   35                       0    0
##   36                       0    0
##   37                       0    0
##   38                       0    0
##   39                       0    0
##   40                       0    0
##   41                       0    0
##   42                       0    0
##   43                       0    0
##   44                       0    0
##   45                       0    0
##   46                       0    0
##   47                       0    0
##   48                       0    0
##   49                       0    0
##   50                       0    0
##   51                       0    0
##   52                       0    0
##   53                       0    0
##   54                       0    0
##   55                       0    0
##   57                       0    0
##   58                       0    0
##   59                       0    0
##   60                       0    0
##   61                       0    0
##   62                       0    0
##   64                       0    0
##   72                       1    0
##   <NA>                     0   90

Include 0 in lowest category

One solution to categorize the “0” values in the lowest category is to add the argument include.lowest = TRUE. Then, any “0” values are still included in the lowest group. Note how the total for the lowest group changed, and that the automatically-generated label for the lowest category changes from “(0,5]” to “[0,5]”, which signifies that 0 values are included. Also note there are still 90 rows missing an age_years value, as expected.

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),    # same breaks
                          include.lowest = TRUE))                            # Includes lowest value (0) in lowest group

table(linelist$age_cat, useNA = "always")
## 
##    [0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100] 
##     1353     1089      985      814     1048      600       27        1 
##     <NA> 
##       90

Customized category labels

Customized category labels** can be added manually with the labels argument. Be careful in your manual labeling! Pay attention to your inclusion/exclusion criteria!

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),    # same breaks
                          include.lowest = TRUE,                            # Includes lowest value (0) in lowest group
                          labels = c("0-5", "6-10", "11-15", "16-20",       # manual labels - be careful!
                                     "21-30", "31-50", "51-70", "71-100")))

table(linelist$age_cat, useNA = "always")
## 
##    0-5   6-10  11-15  16-20  21-30  31-50  51-70 71-100   <NA> 
##   1353   1089    985    814   1048    600     27      1     90

Reversing break inclusion behavior

Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.

NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the include.lowest will now apply to the highest break point value and category, not the lowest.

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),     # same breaks
                          right = FALSE,                                     # include each *lower* break point            
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")))  # now the labels must change

table(linelist$age_cat, useNA = "always")
## 
##    0-4    5-9  10-14  15-19  20-29  30-49  50-69 70-100   <NA> 
##   1110   1114    990    879   1125    668     30      1     90

Re-labeling NA values with cut()

Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified age_cat as a Factor (a rigid column class with specific value labels).

First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (“Missing”). Otherwise you will encounter an error. Then use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These can be combined into one step, as below.

Note that Missing has been added, but the order of the categories is now wrong (alphabetical).

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"))


table(linelist$age_cat, useNA = "always")
## 
##     0-4   10-14   15-19   20-29   30-49     5-9   50-69  70-100 Missing    <NA> 
##    1110     990     879    1125     668    1114      30       1      90       0

To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"),
         
         # re-classify age_cat as Factor, with correct level order and new "Missing" level
         age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
                                              "30-49", "50-69", "70-100", "Missing")))    
  

table(linelist$age_cat, useNA = "always")
## 
##     0-4     5-9   10-14   15-19   20-29   30-49   50-69  70-100 Missing    <NA> 
##    1110    1114     990     879    1125     668      30       1      90       0

Dynamic maximum break

DANGER: If you provide a highest break value that is too low, values may be excluded accidentally! You can write code that automatically adapts, by replacing a static highest number with the max() function.

Don’t forget to include the na.rm = TRUE argument to max().

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, max(age_years, na.rm=TRUE)),
                          include.lowest = TRUE))

table(linelist$age_cat, useNA = "always")
## 
##   [0,5]  (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,72]    <NA> 
##    1353    1089     985     814    1048     600      27       1      90

If you want a fast way to make breaks and labels, you can use something like below (adjust to your specific situation). See the page on using seq() and rep() and c() TO DO

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

# # Use them in the cut() command
# cut(linelist$age, breaks = age_seq, labels = age_labels)

Numeric categories using case_when()

The dplyr function case_when() can also be used to create numeric categories.

  • Allows explicit setting of break point inclusion/exclusion
  • Allows designation of label for NA values in one step
  • More complicated code, arguably more prone to error
  • Allow more flexibility to include other variables in the logic

If using case_when() please review the in-depth page on it, as the logic and order of assignment are important understand to avoid errors.

CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (“Missing”, or the special NA_character_ instead of NA).

You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.

linelist <- linelist %>% 
  mutate(age_cat = factor(case_when(
          # provide the case_when logic and outcomes
          age_years >= 0 & age_years < 5     ~ "0-4",          # logic by age_year value
          age_years >= 5 & age_years < 10    ~ "5-9",
          age_years >= 10 & age_years < 15   ~ "10-14",
          age_years >= 15 & age_years < 20   ~ "15-19",
          age_years >= 20 & age_years < 30   ~ "20-29",
          age_years >= 30 & age_years < 50   ~ "30-49",
          age_years >= 50 & age_years < 70   ~ "50-69",
          age_years >= 45 & age_years <= 100 ~ "70-100",
          is.na(age_years)                   ~ "Missing",  # if age_years is missing
          TRUE                               ~ "Check value"   # catch-all alarm to trigger review
          ), levels = c("0-4","5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100", "Missing", "Check value"))
         )


table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49 
##        1110        1114         990         879        1125         668 
##       50-69      70-100     Missing Check value        <NA> 
##          30           1          90           0           0

Add numeric categories to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_column, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Connaught Hopital" = "Connaught Hospital",
                      "Rokupa Hopital"    = "Rokupa Hospital",
                      "other"             = "Other",
                      "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
        # age categories column
        age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20,
                                     30, 50, 70, max(age_years, na.rm = T)), # breaks
                          include.lowest = TRUE,                             # include 0s in lowest category
                          labels = c("0-5", "6-10", "11-15", "16-20",       # manual labels - be careful!
                                     "21-30", "31-50", "51-70", "71+")),
         # column with 5-year age bins
         age_cat5 = cut(age_years,
                        breaks = seq(0, 100, 5),
                        include.lowest = T))
  

Highest in hierarchy

CAUTION: This tab may follow from previous tabs.

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)

rowwise() dplyr()

https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html


linelist <- linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))

Transforming multiple variables at once

CAUTION: This tab may follow from previous tabs.

across dplyr

A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).

across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.

across() with mutate():

Change all columns to character class

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Change only numeric columns ```

Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale